Skip to content

feat[vortex-cuda]: GPU FSST decompression kernel#7776

Open
asubiotto wants to merge 1 commit intodevelopfrom
asubiotto/fsst-cuda
Open

feat[vortex-cuda]: GPU FSST decompression kernel#7776
asubiotto wants to merge 1 commit intodevelopfrom
asubiotto/fsst-cuda

Conversation

@asubiotto
Copy link
Copy Markdown
Contributor

Summary

This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the throughput_cuda benchmark on a DGX spark. CPU work is required to compute the output offsets.

The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64).

The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there.

Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.

Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying.

Addresses: #6538

Testing

Unit tests against the CPU implementation on small and larger dataset.

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated
@asubiotto asubiotto added the changelog/performance A performance improvement label May 4, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will degrade performance by 10.6%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 1 regressed benchmark
✅ 1168 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 246.1 ns 275.3 ns -10.6%

Comparing asubiotto/fsst-cuda (bebed3a) with develop (903ee6c)

Open in CodSpeed

@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch from c982cd8 to 007bdab Compare May 4, 2026 12:29
@asubiotto asubiotto added changelog/feature A new feature and removed changelog/performance A performance improvement labels May 4, 2026
@asubiotto asubiotto requested review from 0ax1 and robert3005 May 4, 2026 12:33
@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch 3 times, most recently from a10b28e to 79d1f10 Compare May 4, 2026 12:44
Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
@a10y
Copy link
Copy Markdown
Contributor

a10y commented May 4, 2026

Very cool!

@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch from 79d1f10 to 46670e8 Compare May 5, 2026 10:57
@asubiotto
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Addressed the comments.

Copy link
Copy Markdown
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic contribution, one question inline.

Comment thread vortex-cuda/src/kernel/encodings/fsst.rs Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu Outdated
Comment thread vortex-cuda/kernels/src/fsst.cu

// Prefix-sum lens to per-string u32 output offsets so the kernel
// knows where to write each decoded string.
#[expect(clippy::cast_possible_truncation)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unsound to me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is leftover from the hackathon. Ignore. I will template on output offsets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just use u64, but that'd be a little wasteful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to just use u64, not sure templating is worth it. LMK and I can change the approach.

Comment on lines +90 to +99
let output_offsets: Vec<u32> = match_each_integer_ptype!(lens.ptype(), |P| {
let mut out = Vec::with_capacity(lens.len() + 1);
let mut acc: usize = 0;
out.push(0u32);
for &l in lens.as_slice::<P>() {
acc += l as usize;
out.push(acc as u32);
}
out
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if this was computed in the kernel too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you suggest to do this? The kernel needs to know the output offsets in order to decode and I think the only thing that would work would be to do an O(n^2) prefix sum of all lengths for every string. I think it's probably cheaper to execute this linear computation with dependencies on the CPU. A GSST encoding would obviate the need for this so I would just punt on this.

This commit implements on-GPU decompression of the existing FSST encoding. This
kernel achieves ~42% max throughput utilization as compared to the
`throughput_cuda` benchmark on a DGX spark. CPU work is required to compute the
output offsets.

The core performance win is buffering up to 24 bytes of decompressed data in
three u64 registers and emitting the widest aligned stores possible up to
u128 (st.global.v2.u64).

The 256-entry symbol table (≤ 2 KB) is read directly from global memory.
Staging it into shared memory measured ~3% slower at 10M rows and ~15%
slower at 1M rows. The hypothesis is that L1 already holds the table after a
few iterations and the explicit shared copy adds bank-conflict latency on the
warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel
is less bandwidth-bound there.

Further optimizations would require an encoding change. Splits-style
intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead
of per-string) was prototyped on top of this kernel and measured an additional
+30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.

Four kernel variants are generated for the unsigned widths of
codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted
as their unsigned equivalent on the Rust side, so the bit pattern is
preserved without copying.

Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
@asubiotto asubiotto force-pushed the asubiotto/fsst-cuda branch from 46670e8 to bebed3a Compare May 5, 2026 13:18
@asubiotto
Copy link
Copy Markdown
Contributor Author

Updated to use u64 output offsets and cleaned up the kernel parameters into an args struct.

@0ax1
Copy link
Copy Markdown
Contributor

0ax1 commented May 5, 2026

Gave the benchmark a spin also on my end on a GH200 (which has 4 TB/s memory bandwidth):

  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="sudo ncu \
    --kernel-name fsst_u32 \
    --launch-skip 2 \
    --launch-count 1 \
    --metrics \
  smsp__thread_inst_executed_per_inst_executed,\
  smsp__thread_inst_executed_pred_on_per_inst_executed,\
  smsp__sass_average_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_divergent,\
  smsp__sass_inst_executed_op_branch,\
  smsp__warp_issue_stalled_branch_resolving_per_warp_active,\
  smsp__average_warps_issue_stalled_branch_resolving_per_issue_active,\
  smsp__warps_eligible,\
  smsp__issue_active,\
  smsp__inst_issued_per_issue_active" \
  cargo bench -p vortex-cuda --bench fsst_cuda --target aarch64-unknown-linux-gnu -- cuda/fsst/decompress/10M


---------------------------------------------------------------------------- ----------- -------------
Metric Name                                                                  Metric Unit  Metric Value
---------------------------------------------------------------------------- ----------- -------------
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.max_rate        inst             1
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.pct                %         84.03
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.ratio           inst          0.84
smsp__inst_issued_per_issue_active.max_rate                                   inst/cycle             1
smsp__inst_issued_per_issue_active.pct                                                 %           100
smsp__inst_issued_per_issue_active.ratio                                      inst/cycle             1
smsp__issue_active.avg                                                             cycle  4,200,828.18
smsp__issue_active.max                                                             cycle     4,341,189
smsp__issue_active.min                                                             cycle     4,062,625
smsp__issue_active.sum                                                             cycle 2,218,037,277
smsp__sass_average_branch_targets_threads_uniform.max_rate                                           1
smsp__sass_average_branch_targets_threads_uniform.pct                                  %         55.52
smsp__sass_average_branch_targets_threads_uniform.ratio                                           0.56
smsp__sass_branch_targets_threads_divergent.avg                                             124,624.90
smsp__sass_branch_targets_threads_divergent.max                                                162,591
smsp__sass_branch_targets_threads_divergent.min                                                 80,209
smsp__sass_branch_targets_threads_divergent.sum                                             65,801,947
smsp__sass_branch_targets_threads_uniform.avg                                               155,567.00
smsp__sass_branch_targets_threads_uniform.max                                                  203,499
smsp__sass_branch_targets_threads_uniform.min                                                   99,768
smsp__sass_branch_targets_threads_uniform.sum                                               82,139,378
smsp__sass_inst_executed_op_branch.avg                                              inst    280,191.90
smsp__sass_inst_executed_op_branch.max                                              inst       365,698
smsp__sass_inst_executed_op_branch.min                                              inst       180,368
smsp__sass_inst_executed_op_branch.sum                                              inst   147,941,325
smsp__thread_inst_executed_per_inst_executed.max_rate                                               32
smsp__thread_inst_executed_per_inst_executed.pct                                       %         32.40
smsp__thread_inst_executed_per_inst_executed.ratio                                               10.37
smsp__thread_inst_executed_pred_on_per_inst_executed.max_rate                                       32
smsp__thread_inst_executed_pred_on_per_inst_executed.pct                               %         25.77
smsp__thread_inst_executed_pred_on_per_inst_executed.ratio                                        8.25
smsp__warp_issue_stalled_branch_resolving_per_warp_active.max_rate                                   1
smsp__warp_issue_stalled_branch_resolving_per_warp_active.pct                          %          7.53
smsp__warp_issue_stalled_branch_resolving_per_warp_active.ratio                                   0.08
smsp__warps_eligible.avg                                                            warp 18,937,555.73
smsp__warps_eligible.max                                                            warp    19,878,429
smsp__warps_eligible.min                                                            warp    18,067,927
smsp__warps_eligible.sum                                                            warp 9,999,029,425
---------------------------------------------------------------------------- ----------- -------------

cuda/cuda/fsst/decompress/10M
                        time:   [2.9651 ms 2.9839 ms 3.0055 ms]
                        thrpt:  [286.19 GiB/s 288.26 GiB/s 290.08 GiB/s]
                 change:
                        time:   [-12.955% -12.300% -11.621%] (p = 0.00 < 0.05)
                        thrpt:  [+13.149% +14.025% +14.883%]
                        Performance has improved.

So the main bottleneck for this kernel is warp divergence:

smsp__sass_average_branch_targets_threads_uniform.pct 55.52%

For comparison we e.g. get

  cuda/cuda/runend_100runs/dispatch_u32/10M
  time:   [26.832 µs 29.795 µs 35.405 µs]
  thrpt:  [1052.2 GiB/s 1250.3 GiB/s 1388.4 GiB/s]

  cuda/cuda/runend_100runs/dispatch_u32/100M
  time:   [159.44 µs 160.21 µs 161.11 µs]
  thrpt:  [2312.2 GiB/s 2325.2 GiB/s 2336.4 GiB/s]

for runend.


use crate::timed_launch_strategy::TimedLaunchStrategy;

const BENCH_SIZES: &[(usize, &str)] = &[(1_000_000, "1M"), (5_000_000, "5M"), (10_000_000, "10M")];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we run all benchmarks for vortex-cuda with 100M and re-use the constant from vortex-cuda/benches/bench_config/mod.rs. 10M inputs proved to be too noisy when running the benchmarks in codspeed and can't saturate larger GPUs like a GH200.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the difference is that with FSST each input is a clickbench URL. Happy to use 100M if you want, but I think that's around 10GB.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair, we can keep 10M but let's add comment on the constant why it's diff here or so.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we also need to wire in the bench to codspeed in: .github/workflows/codspeed.yml. We can put it next to zstd which is nice for comparison. - { shard: 7, name: "Encodings 4", packages: "vortex-sparse vortex-zigzag vortex-zstd" }.

@asubiotto
Copy link
Copy Markdown
Contributor Author

Gave the benchmark a spin also on my end on a GH200 (which has TB/s memory bandwidth):

  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="sudo ncu \
    --kernel-name fsst_u32 \
    --launch-skip 2 \
    --launch-count 1 \
    --metrics \
  smsp__thread_inst_executed_per_inst_executed,\
  smsp__thread_inst_executed_pred_on_per_inst_executed,\
  smsp__sass_average_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_uniform,\
  smsp__sass_branch_targets_threads_divergent,\
  smsp__sass_inst_executed_op_branch,\
  smsp__warp_issue_stalled_branch_resolving_per_warp_active,\
  smsp__average_warps_issue_stalled_branch_resolving_per_issue_active,\
  smsp__warps_eligible,\
  smsp__issue_active,\
  smsp__inst_issued_per_issue_active" \
  cargo bench -p vortex-cuda --bench fsst_cuda --target aarch64-unknown-linux-gnu -- cuda/fsst/decompress/10M


---------------------------------------------------------------------------- ----------- -------------
Metric Name                                                                  Metric Unit  Metric Value
---------------------------------------------------------------------------- ----------- -------------
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.max_rate        inst             1
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.pct                %         84.03
smsp__average_warps_issue_stalled_branch_resolving_per_issue_active.ratio           inst          0.84
smsp__inst_issued_per_issue_active.max_rate                                   inst/cycle             1
smsp__inst_issued_per_issue_active.pct                                                 %           100
smsp__inst_issued_per_issue_active.ratio                                      inst/cycle             1
smsp__issue_active.avg                                                             cycle  4,200,828.18
smsp__issue_active.max                                                             cycle     4,341,189
smsp__issue_active.min                                                             cycle     4,062,625
smsp__issue_active.sum                                                             cycle 2,218,037,277
smsp__sass_average_branch_targets_threads_uniform.max_rate                                           1
smsp__sass_average_branch_targets_threads_uniform.pct                                  %         55.52
smsp__sass_average_branch_targets_threads_uniform.ratio                                           0.56
smsp__sass_branch_targets_threads_divergent.avg                                             124,624.90
smsp__sass_branch_targets_threads_divergent.max                                                162,591
smsp__sass_branch_targets_threads_divergent.min                                                 80,209
smsp__sass_branch_targets_threads_divergent.sum                                             65,801,947
smsp__sass_branch_targets_threads_uniform.avg                                               155,567.00
smsp__sass_branch_targets_threads_uniform.max                                                  203,499
smsp__sass_branch_targets_threads_uniform.min                                                   99,768
smsp__sass_branch_targets_threads_uniform.sum                                               82,139,378
smsp__sass_inst_executed_op_branch.avg                                              inst    280,191.90
smsp__sass_inst_executed_op_branch.max                                              inst       365,698
smsp__sass_inst_executed_op_branch.min                                              inst       180,368
smsp__sass_inst_executed_op_branch.sum                                              inst   147,941,325
smsp__thread_inst_executed_per_inst_executed.max_rate                                               32
smsp__thread_inst_executed_per_inst_executed.pct                                       %         32.40
smsp__thread_inst_executed_per_inst_executed.ratio                                               10.37
smsp__thread_inst_executed_pred_on_per_inst_executed.max_rate                                       32
smsp__thread_inst_executed_pred_on_per_inst_executed.pct                               %         25.77
smsp__thread_inst_executed_pred_on_per_inst_executed.ratio                                        8.25
smsp__warp_issue_stalled_branch_resolving_per_warp_active.max_rate                                   1
smsp__warp_issue_stalled_branch_resolving_per_warp_active.pct                          %          7.53
smsp__warp_issue_stalled_branch_resolving_per_warp_active.ratio                                   0.08
smsp__warps_eligible.avg                                                            warp 18,937,555.73
smsp__warps_eligible.max                                                            warp    19,878,429
smsp__warps_eligible.min                                                            warp    18,067,927
smsp__warps_eligible.sum                                                            warp 9,999,029,425
---------------------------------------------------------------------------- ----------- -------------

cuda/cuda/fsst/decompress/10M
                        time:   [2.9651 ms 2.9839 ms 3.0055 ms]
                        thrpt:  [286.19 GiB/s 288.26 GiB/s 290.08 GiB/s]
                 change:
                        time:   [-12.955% -12.300% -11.621%] (p = 0.00 < 0.05)
                        thrpt:  [+13.149% +14.025% +14.883%]
                        Performance has improved.

So the main bottleneck for this kernel is warp divergence:

smsp__sass_average_branch_targets_threads_uniform.pct 55.52%

Nice, wish I had one of those lying around 😂 Yes, that's the main idea behind the GSST splits. Precomputing splits on the CPU showed +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M on the DGX spark. While we can precompute splits I think maybe it's a little too much CPU prep and this is rather a motivation for the GSST encoding? Happy to introduce splits in a follow up PR if we want to do that.

@0ax1
Copy link
Copy Markdown
Contributor

0ax1 commented May 5, 2026

Nice, wish I had one of those lying around 😂 Yes, that's the main idea behind the GSST splits. Precomputing splits on the CPU showed +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M on the DGX spark. While we can precompute splits I think maybe it's a little too much CPU prep and this is rather a motivation for the GSST encoding? Happy to introduce splits in a follow up PR if we want to do that.

Heh, I mean this is clearly not a blocker but I wanted to pin down the exact numbers. Can we def land as is from a perf end.

Comment on lines +183 to +195
let host_bytes = CudaDeviceBuffer::new(device_output)
.copy_to_host(Alignment::new(1))?
.await?;
let host_bytes = host_bytes.slice(0..total_size);

let (buffers, views) = match_each_integer_ptype!(lens.ptype(), |P| {
build_views(
0,
MAX_BUFFER_LEN,
host_bytes.into_mut(),
lens.as_slice::<P>(),
)
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we build the views on the GPU?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably since we already compute the output offsets. It's just a question of modding by i32::MAX to split up the output into a couple of buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants